Accessing Cloud Optimized GeoTIFF (COG) - S3 Direct Access

Summary

In this notebook, we will access data for the ECOSTRESS Tiled Land Surface Temperature and Emissivity Instantaneous L2 Global 70 m V002 data product. These data are archived and distributed as Cloud Optimized GeoTIFF (COG) files, one file for each variable.

We will access a single COG file, Land Surface Temperature (LST), from inside the AWS cloud (us-west-2 region, specifically) and load it into Python as an xarray dataarray. This approach leverages S3 native protocols for efficient access to the data.

Requirements

1. AWS instance running in us-west-2

NASA Earthdata Cloud data in S3 can be directly accessed via temporary credentials; this access is limited to requests made within the US West (Oregon) (code: us-west-2) AWS region.

2. Earthdata Login

An Earthdata Login account is required to access data from the NASA Earthdata system. Thus, to access NASA data, you need Earthdata Login. Please visit https://urs.earthdata.nasa.gov to register and manage your Earthdata Login account. This account is free to create and only takes a moment to set up.

3. netrc File

You will need a netrc file containing your NASA Earthdata Login credentials in order to execute the notebooks. A netrc file can be created manually within text editor and saved to your home directory. For additional information see: Authentication for NASA Earthdata.

Learning Objectives

  • how to retrieve temporary S3 credentials for in-region direct S3 bucket access
  • how to perform in-region direct access of ECOSTRESS Cloud Optimized geoTIFF (COG) files in S3
  • how to plot the data

Import Packages

import os
import requests 
import boto3
from osgeo import gdal
import rasterio as rio
from rasterio.session import AWSSession
import rioxarray
import xarray as xr
import geopandas
from shapely.geometry import Polygon
from shapely.ops import transform
import pyproj
from pyproj import Proj
import hvplot.xarray
import holoviews as hv
import geoviews as gv
gv.extension('bokeh', 'matplotlib')

Get Temporary AWS Credentials

Direct S3 access is achieved by passing NASA supplied temporary credentials to AWS so we can interact with S3 objects from applicable Earthdata Cloud buckets. For now, each NASA DAAC has different AWS credentials endpoints. Below are some of the credential endpoints to various DAACs:

s3_cred_endpoint = {
    'podaac':'https://archive.podaac.earthdata.nasa.gov/s3credentials',
    'gesdisc': 'https://data.gesdisc.earthdata.nasa.gov/s3credentials',
    'lpdaac':'https://data.lpdaac.earthdatacloud.nasa.gov/s3credentials',
    'ornldaac': 'https://data.ornldaac.earthdata.nasa.gov/s3credentials',
    'ghrcdaac': 'https://data.ghrc.earthdata.nasa.gov/s3credentials'
}

Create a function to make a request to an endpoint for temporary credentials. Remember, each DAAC has their own endpoint and credentials are not usable for cloud data from other DAACs.

def get_temp_creds(provider):
    return requests.get(s3_cred_endpoint[provider]).json()
temp_creds_req = get_temp_creds('lpdaac')
#temp_creds_req

Workspace Environment Setup

For this exercise, we are going to open up a context manager for the notebook using the rasterio.env module to store the required GDAL and AWS configurations we need to access the data in Earthdata Cloud. While the context manager is open (rio_env.__enter__()) we will be able to run the open or get data commands that would typically be executed within a with statement, thus allowing us to more freely interact with the data. We’ll close the context (rio_env.__exit__()) at the end of the notebook.

Create a boto3 Session object using your temporary credentials. This Session is used to pass credentials and configuration to AWS so we can interact wit S3 objects from applicable buckets.

session = boto3.Session(aws_access_key_id=temp_creds_req['accessKeyId'], 
                        aws_secret_access_key=temp_creds_req['secretAccessKey'],
                        aws_session_token=temp_creds_req['sessionToken'],
                        region_name='us-west-2')

GDAL environment variables must be configured to access COGs in Earthdata Cloud. Geospatial data access Python packages like rasterio and rioxarray depend on GDAL, leveraging GDAL’s “Virtual File Systems” to read remote files. GDAL has a lot of environment variables that control it’s behavior. Changing these settings can mean the difference being able to access a file or not. They can also have an impact on the performance.

rio_env = rio.Env(AWSSession(session),
                  GDAL_DISABLE_READDIR_ON_OPEN='TRUE',
                  GDAL_HTTP_COOKIEFILE=os.path.expanduser('~/cookies.txt'),
                  GDAL_HTTP_COOKIEJAR=os.path.expanduser('~/cookies.txt'))
rio_env.__enter__()
<rasterio.env.Env at 0x7fcfee922fa0>

In this example we’re interested in the ECOSTRESS Tiled Land Surface Temperature and Emissivity data collection from NASA’s LP DAAC in Earthdata Cloud. Below we specify the S3 URL to the data asset in Earthdata Cloud. This URL can be found via Earthdata Search or programmatically through the CMR and CMR-STAC APIs.

#s3_url = 's3://lp-prod-protected/HLSL30.020/HLS.L30.T10SGD.2020272T183449.v2.0/HLS.L30.T10SGD.2020272T183449.v2.0.B04.tif'
s3_url_lst = 's3://lp-prod-protected/ECO_L2T_LSTE.002/ECOv002_L2T_LSTE_24479_001_11SKU_20221030T092522_0710_01/ECOv002_L2T_LSTE_24479_001_11SKU_20221030T092522_0710_01_LST.tif'
s3_url_qa = 's3://lp-prod-protected/ECO_L2T_LSTE.002/ECOv002_L2T_LSTE_24479_001_11SKU_20221030T092522_0710_01/ECOv002_L2T_LSTE_24479_001_11SKU_20221030T092522_0710_01_QC.tif'

Direct In-region Access

Read in the ECOSTRESS Tiles LST S3 URL into our workspace using rioxarray, an extension of xarray used to read geospatial data.

da = rioxarray.open_rasterio(s3_url_lst, chunks='auto')
da
<xarray.DataArray (band: 1, y: 1568, x: 1568)>
dask.array<open_rasterio-8869317cab2408c4770e6caee839802c<this-array>, shape=(1, 1568, 1568), dtype=float32, chunksize=(1, 1568, 1568), chunktype=numpy.ndarray>
Coordinates:
  * band         (band) int64 1
  * x            (x) float64 2e+05 2.001e+05 2.002e+05 ... 3.096e+05 3.097e+05
  * y            (y) float64 3.9e+06 3.9e+06 3.9e+06 ... 3.79e+06 3.79e+06
    spatial_ref  int64 0
Attributes:
    _FillValue:    nan
    scale_factor:  1.0
    add_offset:    0.0
qa = rioxarray.open_rasterio(s3_url_qa, chunks='auto')
qa
<xarray.DataArray (band: 1, y: 1568, x: 1568)>
dask.array<open_rasterio-cbe9aac41d32a53ce5e8a6acfb1a9f2b<this-array>, shape=(1, 1568, 1568), dtype=uint16, chunksize=(1, 1568, 1568), chunktype=numpy.ndarray>
Coordinates:
  * band         (band) int64 1
  * x            (x) float64 2e+05 2.001e+05 2.002e+05 ... 3.096e+05 3.097e+05
  * y            (y) float64 3.9e+06 3.9e+06 3.9e+06 ... 3.79e+06 3.79e+06
    spatial_ref  int64 0
Attributes:
    scale_factor:  1.0
    add_offset:    0.0
LST_dataset = xr.Dataset({'LST': da, 'quality': qa})
LST_dataset
<xarray.Dataset>
Dimensions:      (band: 1, x: 1568, y: 1568)
Coordinates:
  * band         (band) int64 1
  * x            (x) float64 2e+05 2.001e+05 2.002e+05 ... 3.096e+05 3.097e+05
  * y            (y) float64 3.9e+06 3.9e+06 3.9e+06 ... 3.79e+06 3.79e+06
    spatial_ref  int64 0
Data variables:
    LST          (band, y, x) float32 dask.array<chunksize=(1, 1568, 1568), meta=np.ndarray>
    quality      (band, y, x) uint16 dask.array<chunksize=(1, 1568, 1568), meta=np.ndarray>

The file is read into Python as an xarray dataarray with a band, x, and y dimension. In this example the band dimension is meaningless, so we’ll use the squeeze() function to remove band as a dimension.

da_lst = LST_dataset.squeeze('band', drop=True)
da_lst
<xarray.Dataset>
Dimensions:      (x: 1568, y: 1568)
Coordinates:
  * x            (x) float64 2e+05 2.001e+05 2.002e+05 ... 3.096e+05 3.097e+05
  * y            (y) float64 3.9e+06 3.9e+06 3.9e+06 ... 3.79e+06 3.79e+06
    spatial_ref  int64 0
Data variables:
    LST          (y, x) float32 dask.array<chunksize=(1568, 1568), meta=np.ndarray>
    quality      (y, x) uint16 dask.array<chunksize=(1568, 1568), meta=np.ndarray>

Plot the dataarray, representing the LST, using hvplot.

da_lst['LST'].hvplot.image(x = 'x', y = 'y', crs = 'EPSG:32610', cmap='jet', rasterize=False, tiles='EsriImagery', width=800, height=600, colorbar=True, title = 'Land Surface Temperature')

Define the Region of Interest for Clipping

We’ll read in our GeoJSON file of our points of interest and create bounding box that contains a points coordinates

field = geopandas.read_file('../../tutorials/landcover.geojson')

Extract the min/max values for the y and x axis

minx, miny, maxx, maxy = field.geometry.total_bounds
minx, miny, maxx, maxy
(-119.54552918448296,
 34.39556000406682,
 -119.52621487396785,
 34.40555139039825)

Order the coordinates for the bounding box counterclockwise

coords = [
    (minx, miny),
    (maxx, miny),
    (maxx, maxy),
    (minx, maxy)
]

Create a shapely polygon

feature_shape = Polygon(coords)
feature_shape

base = gv.tile_sources.EsriImagery.opts(width=700, height=500)
farmField = gv.Polygons(feature_shape).opts(line_color='yellow', line_width=10, color=None)
base * farmField 

Let’s take a look at the bounding coordinate values.

Note, the values above are in decimal degrees and represent the longitude and latitude for the lower left corner and upper right corner respectively.

feature_shape.bounds
(-119.54552918448296,
 34.39556000406682,
 -119.52621487396785,
 34.40555139039825)

Get the projection information from the ECOSTRESS file

src_proj = da_lst.rio.crs
src_proj
CRS.from_epsg(32611)

Transform coordinates from lat lon (units = dd) to UTM (units = m)

geo_CRS = Proj('+proj=longlat +datum=WGS84 +no_defs', preserve_units=True)   # Source coordinate system of the ROI
project = pyproj.Transformer.from_proj(geo_CRS, src_proj)                    # Set up the transformation
fsUTM = transform(project.transform, feature_shape)
fsUTM.bounds
(265993.0454523829, 3808909.50873117, 267796.62994798424, 3810062.237028728)

The coordinates for our feature have now been converted to source raster projection. Note the difference in the values between feature_shape.bounds (in geographic) and fsUTM.bounds (in UTM projection).

Now we can clip our ECOSTRESS LST file to our region of insterest!

Access and clip the ECOSTRESS LST COG

We can now use our transformed ROI bounding box to clip the ECOSTRESS S3 object we accessed before. We’ll use the rio.clip

da_lst_clip = rioxarray.open_rasterio(s3_url_lst, chunks='auto').squeeze('band', drop=True).rio.clip([fsUTM])
da_lst_clip
<xarray.DataArray (y: 16, x: 26)>
dask.array<copy, shape=(16, 26), dtype=float32, chunksize=(16, 26), chunktype=numpy.ndarray>
Coordinates:
  * y            (y) float64 3.81e+06 3.81e+06 3.81e+06 ... 3.809e+06 3.809e+06
  * x            (x) float64 2.66e+05 2.661e+05 ... 2.677e+05 2.678e+05
    spatial_ref  int64 0
Attributes:
    scale_factor:  1.0
    add_offset:    0.0
    _FillValue:    nan
da_lst_clip.hvplot.image(x = 'x', y = 'y', crs = 'EPSG:32610', cmap='jet', rasterize=False, width=800, height=600,title = 'Land Surface Temperature (Kelvin)', colorbar=True)

Exit the context manager.

rio_env.__exit__()

Resources

Direct S3 Data Access with rioxarray

Direct_S3_Access__gdalvrt

Direct_S3_Access__rioxarray_clipping

Getting Started with Cloud-Native Harmonized Landsat Sentinel-2 (HLS) Data in R